Studying crime data is crucial for comprehending patterns, trends, and possible interventions in ensuring public safety and security. In this study, we will analyze the policing data from Colchester in 2023, along with weather information, with the goal of revealing insights and creating engaging visualizations of the data. The dataset on policing in Colchester provides a wealth of information on different aspects of crime incidents, such as categories, locations, and results. By examining this data set, we can acquire important knowledge about crime trends in the area, the various factors impacting crime rates, and possible approaches to preventing and enforcing laws.
Furthermore, we include weather information in our study to investigate the relationship and connection between weather patterns and criminal activity in Colchester. Understanding the potential influence of weather patterns on crime rates can offer valuable information for law enforcement agencies and urban planners when creating successful approaches for preventing and addressing crime.
Our main goal is to conduct a thorough analysis of the policing dataset from Colchester in 2023, incorporating weather data through data visualization. Our goal is to investigate where and when crime incidents occur, uncover patterns, trends, and relationships in the data, demonstrate unique ways to visualize and present the information, and suggest future analysis and actions based on our discoveries.
The report is designed to lead readers through our analysis procedure, beginning with data preparation and basic visualizations for each dataset, moving on to advanced visualizations and interpretation of the results.
By combining statistical analysis, data visualization methods, and narrative storytelling, the goal of this project is to provide a detailed insight into the patterns of crime in Colchester and investigate how weather conditions impact crime rates. Let’s start this adventure of exploring and discovering using data visualization as our guide.
In this section, we will describe the process of loading and preprocessing the policing dataset from Colchester in 2023, as well as the weather dataset. Proper data preparation is crucial for ensuring the accuracy and reliability of our analysis.
We will start by loading the policing dataset and the weather dataset into our R environment. The policing dataset contains information about crime incidents, including categories, locations, dates, and outcomes. The weather dataset provides information about weather conditions such as temperature, precipitation, and wind speed.
temp_data <- read.csv('temp2023.csv')
crime_data <- read.csv('crime23.csv')
str(temp_data)
## 'data.frame': 365 obs. of 18 variables:
## $ station_ID : int 3590 3590 3590 3590 3590 3590 3590 3590 3590 3590 ...
## $ Date : chr "2023-12-31" "2023-12-30" "2023-12-29" "2023-12-28" ...
## $ TemperatureCAvg: num 8.7 6.6 9.9 9.9 5.8 9.8 12.5 10 9.6 10 ...
## $ TemperatureCMax: num 10.6 9.7 11.4 11.5 10.6 12.7 14.3 12 10.8 12.6 ...
## $ TemperatureCMin: num 4.4 4.4 6.9 4 3.9 6.3 9.5 8.4 8.1 8.1 ...
## $ TdAvgC : num 7.2 4.2 6 7.5 3.7 7.6 10.1 7 6.5 6.2 ...
## $ HrAvg : num 89.6 85.5 77.2 84.6 86.4 86.9 85.3 81.5 81.2 78.2 ...
## $ WindkmhDir : chr "S" "WSW" "SW" "SSW" ...
## $ WindkmhInt : num 25 22.7 32.8 32.2 13.2 23.5 34.1 32.7 34.1 37.5 ...
## $ WindkmhGust : num 63 50 61.2 70.4 37.1 46.3 72.3 61.2 68.6 77.8 ...
## $ PresslevHp : num 999 1007 1004 1003 1016 ...
## $ Precmm : num 6.2 0.4 0.8 2.8 2 4.4 0.8 0.8 0 2 ...
## $ TotClOct : num 8 4.6 6.5 6.8 4 6.5 7.8 5 8 7.5 ...
## $ lowClOct : num 8 6.5 6.7 7.1 6.9 7.4 7.8 6.7 8 7.5 ...
## $ SunD1h : num 0 1.1 0.1 0 3.2 0 0 2.9 0 1.4 ...
## $ VisKm : num 26.3 48.3 26.7 25.1 30.1 45.8 61.8 72.9 69.4 34.3 ...
## $ PreselevHp : logi NA NA NA NA NA NA ...
## $ SnowDepcm : int NA NA NA NA NA NA NA NA NA NA ...
str(crime_data)
## 'data.frame': 6878 obs. of 12 variables:
## $ category : chr "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
## $ persistent_id : chr "" "" "" "" ...
## $ date : chr "2023-01" "2023-01" "2023-01" "2023-01" ...
## $ lat : num 51.9 51.9 51.9 51.9 51.9 ...
## $ long : num 0.909 0.902 0.898 0.902 0.895 ...
## $ street_id : int 2153366 2153173 2153077 2153186 2153012 2153379 2153105 2153541 2152937 2153107 ...
## $ street_name : chr "On or near Military Road" "On or near " "On or near Culver Street West" "On or near Ryegate Road" ...
## $ context : logi NA NA NA NA NA NA ...
## $ id : int 107596596 107596646 107595950 107595953 107595979 107595985 107596603 107596291 107596305 107596453 ...
## $ location_type : chr "Force" "Force" "Force" "Force" ...
## $ location_subtype: chr "" "" "" "" ...
## $ outcome_status : chr NA NA NA NA ...
The weather dataset contains 365 observations of 18 variables. These variables include station_ID, Date, TemperatureCAvg, TemperatureCMax, TemperatureCMin, TdAvgC, HrAvg, WindkmhDir, WindkmhInt, WindkmhGust, PresslevHp, Precmm, TotClOct, lowClOct, SunD1h, VisKm, PreselevHp, and SnowDepcm.
The policing dataset contains 6878 observations of 12 variables. These variables include category, persistent_id, date, lat, long, street_id, street_name, context, id, location_type, location_subtype, and outcome_status.
In the first step of the Data Preparation, we need to check if the data sets have any missing values and whether there are any columns which have incomplete or inconsequential data.
crime_data[crime_data == ""] <- NA
temp_data[temp_data == ""] <- NA
crime_missing <- is.na(crime_data)
temp_missing <- is.na(temp_data)
counts_crime_missing <- colSums(crime_missing)
counts_temp_missing <- colSums(temp_missing)
crime_na_counts_df <- data.frame(t(counts_crime_missing))
temp_na_counts_df <- data.frame(t(counts_temp_missing))
#Removing the columns with value == 0 from both the summary tables
zero_cols <- colSums(crime_na_counts_df == 0, na.rm = TRUE) == nrow(crime_na_counts_df)
crime_na_counts_df <- crime_na_counts_df[, !zero_cols]
zero_cols <- colSums(temp_na_counts_df == 0, na.rm = TRUE) == nrow(temp_na_counts_df)
temp_na_counts_df <- temp_na_counts_df[, !zero_cols]
kable(crime_na_counts_df, caption = "Crime Dataset Missing Values")
| persistent_id | context | location_subtype | outcome_status |
|---|---|---|---|
| 701 | 6878 | 6854 | 677 |
kable(temp_na_counts_df, caption = "Weather Dataset Missing Values")
| Precmm | lowClOct | SunD1h | PreselevHp | SnowDepcm |
|---|---|---|---|---|
| 27 | 13 | 82 | 365 | 364 |
Hence, from this table we can see that both the data sets have some missing values and we need to handle these missing values so that they do not interfere in the further processing of the data.
From the Missing Value Summary table we can make the following observations:
library(dplyr)
crime_data <- crime_data %>%
select(-context, -location_subtype) %>%
select(-persistent_id, -id)
crime_data$outcome_status[is.na(crime_data$outcome_status)] <- "No Outcome"
sum(is.na(crime_data))
## [1] 0
str(crime_data)
## 'data.frame': 6878 obs. of 8 variables:
## $ category : chr "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" "anti-social-behaviour" ...
## $ date : chr "2023-01" "2023-01" "2023-01" "2023-01" ...
## $ lat : num 51.9 51.9 51.9 51.9 51.9 ...
## $ long : num 0.909 0.902 0.898 0.902 0.895 ...
## $ street_id : int 2153366 2153173 2153077 2153186 2153012 2153379 2153105 2153541 2152937 2153107 ...
## $ street_name : chr "On or near Military Road" "On or near " "On or near Culver Street West" "On or near Ryegate Road" ...
## $ location_type : chr "Force" "Force" "Force" "Force" ...
## $ outcome_status: chr "No Outcome" "No Outcome" "No Outcome" "No Outcome" ...
From the Missing Value Summary table we can make the following observations:
temp_data <- temp_data %>%
select(-PreselevHp, -SnowDepcm)
temp_data$Precmm[is.na(temp_data$Precmm)] <- 0
temp_data$lowClOct[is.na(temp_data$lowClOct)] <- 0
temp_data$SunD1h[is.na(temp_data$SunD1h)] <- 0
Now that we have cleaned both the datasets, it is essential we work on the data types of the data sets.
In the weather data, we need to convert the dates in the weather dataset from type character to type date:
temp_data$Date <- as.Date(temp_data$Date)
Next, we need to look at the weather data and identify the qualitative variables in the dataset and convert them into factors. In the given dataset, the WindkmhDir variable is a qualitative variable, rest of the variables are numeric variables, so converting it from type character to factor
temp_data$WindkmhDir <- factor(temp_data$WindkmhDir)
str(temp_data)
## 'data.frame': 365 obs. of 16 variables:
## $ station_ID : int 3590 3590 3590 3590 3590 3590 3590 3590 3590 3590 ...
## $ Date : Date, format: "2023-12-31" "2023-12-30" ...
## $ TemperatureCAvg: num 8.7 6.6 9.9 9.9 5.8 9.8 12.5 10 9.6 10 ...
## $ TemperatureCMax: num 10.6 9.7 11.4 11.5 10.6 12.7 14.3 12 10.8 12.6 ...
## $ TemperatureCMin: num 4.4 4.4 6.9 4 3.9 6.3 9.5 8.4 8.1 8.1 ...
## $ TdAvgC : num 7.2 4.2 6 7.5 3.7 7.6 10.1 7 6.5 6.2 ...
## $ HrAvg : num 89.6 85.5 77.2 84.6 86.4 86.9 85.3 81.5 81.2 78.2 ...
## $ WindkmhDir : Factor w/ 16 levels "E","ENE","ESE",..: 9 16 13 12 13 16 16 16 14 15 ...
## $ WindkmhInt : num 25 22.7 32.8 32.2 13.2 23.5 34.1 32.7 34.1 37.5 ...
## $ WindkmhGust : num 63 50 61.2 70.4 37.1 46.3 72.3 61.2 68.6 77.8 ...
## $ PresslevHp : num 999 1007 1004 1003 1016 ...
## $ Precmm : num 6.2 0.4 0.8 2.8 2 4.4 0.8 0.8 0 2 ...
## $ TotClOct : num 8 4.6 6.5 6.8 4 6.5 7.8 5 8 7.5 ...
## $ lowClOct : num 8 6.5 6.7 7.1 6.9 7.4 7.8 6.7 8 7.5 ...
## $ SunD1h : num 0 1.1 0.1 0 3.2 0 0 2.9 0 1.4 ...
## $ VisKm : num 26.3 48.3 26.7 25.1 30.1 45.8 61.8 72.9 69.4 34.3 ...
In the crime data we have the following qualitative variables: - category - street_id - street_name - location_type - outcome_status
So, we are converting all of these into factors:
crime_data$category <- factor(crime_data$category)
crime_data$street_id <- factor(crime_data$street_id)
crime_data$street_name <- factor(crime_data$street_name)
crime_data$location_type <- factor(crime_data$location_type)
crime_data$outcome_status <- factor(crime_data$outcome_status)
str(crime_data)
## 'data.frame': 6878 obs. of 8 variables:
## $ category : Factor w/ 14 levels "anti-social-behaviour",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ date : chr "2023-01" "2023-01" "2023-01" "2023-01" ...
## $ lat : num 51.9 51.9 51.9 51.9 51.9 ...
## $ long : num 0.909 0.902 0.898 0.902 0.895 ...
## $ street_id : Factor w/ 375 levels "2152702","2152722",..: 254 178 133 185 98 257 148 305 63 149 ...
## $ street_name : Factor w/ 351 levels "Colchester Town (station)",..: 206 2 93 265 196 186 2 124 305 323 ...
## $ location_type : Factor w/ 2 levels "BTP","Force": 2 2 2 2 2 2 2 2 2 2 ...
## $ outcome_status: Factor w/ 14 levels "Action to be taken by another organisation",..: 9 9 9 9 9 9 9 9 9 9 ...
This concludes our Data Preparation stage, we have succesfully cleaned the data set by handling the missing values and also removing unnecessary columns from the data which had less or no data and relevance. We have also converted the qualitative variables into factors as this will help in out further analysis.
In the crime23 dataset, we have a total of 6878 observations with 8 variables post cleaning. If we look at the dataset, we can observe that the entries in the data are tagged with the month in which the crime occurred, so the quantum of time we can use to observe temporal changes in the data is “monthly”.
To start our exploration of the crime data we can take a look at the temporal changes in the data, in other words we can take a look at how the crimes happened on a monthly basis. In order to do this, we can start by plotting the count of the total crimes that happened in a month in the year 2023. Plotting the same as a bar plot to understand the total crimes by month.
crime_df <- crime_data
temp_df <- temp_data
# Extract the month from the date strings
crime_data$month <- substr(crime_data$date, 6, 7)
# Count the number of crimes per month for the year 2023
crime_count <- table(crime_data$month)
# Plot the count of total crimes by month as a bar plot
barplot(crime_count,
main = "Total Crimes by Month in 2023",
xlab = "Month",
ylab = "Total Crimes",
col = "skyblue",
ylim = c(0, max(crime_count) + 100),
names.arg = month.abb)
Next, we can look at the crime categories and their occurences by month, we can visualise this by the help of a stacked barplot and try to understand how the category of the crime vary by month.
crime_data$month <- factor(crime_data$month)
months <- levels(crime_data$month)
crime_table <- table(crime_data$category, crime_data$month)
kable(crime_table, caption = "Crime Count by Month")
| 01 | 02 | 03 | 04 | 05 | 06 | 07 | 08 | 09 | 10 | 11 | 12 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| anti-social-behaviour | 46 | 49 | 21 | 53 | 67 | 52 | 76 | 71 | 90 | 68 | 39 | 45 |
| bicycle-theft | 20 | 14 | 19 | 16 | 16 | 14 | 15 | 21 | 37 | 26 | 27 | 10 |
| burglary | 17 | 22 | 14 | 22 | 15 | 26 | 14 | 20 | 18 | 31 | 11 | 15 |
| criminal-damage-arson | 59 | 37 | 52 | 63 | 64 | 42 | 42 | 33 | 47 | 45 | 53 | 44 |
| drugs | 14 | 17 | 21 | 21 | 22 | 15 | 17 | 7 | 25 | 19 | 13 | 17 |
| other-crime | 7 | 5 | 6 | 15 | 3 | 11 | 12 | 9 | 7 | 6 | 5 | 6 |
| other-theft | 48 | 37 | 35 | 38 | 42 | 41 | 51 | 41 | 34 | 49 | 37 | 38 |
| possession-of-weapons | 3 | 3 | 11 | 5 | 7 | 3 | 8 | 5 | 8 | 6 | 8 | 7 |
| public-order | 45 | 42 | 58 | 51 | 37 | 36 | 40 | 41 | 45 | 52 | 45 | 40 |
| robbery | 8 | 7 | 8 | 7 | 7 | 17 | 6 | 5 | 8 | 9 | 5 | 7 |
| shoplifting | 76 | 31 | 51 | 40 | 51 | 59 | 33 | 57 | 33 | 43 | 39 | 41 |
| theft-from-the-person | 6 | 7 | 12 | 7 | 5 | 6 | 9 | 5 | 7 | 3 | 4 | 5 |
| vehicle-crime | 65 | 15 | 21 | 29 | 24 | 45 | 25 | 16 | 20 | 26 | 56 | 64 |
| violent-crime | 237 | 181 | 226 | 207 | 226 | 196 | 236 | 219 | 263 | 209 | 221 | 212 |
library(ggplot2)
# Create a data frame from the provided table
crime_data_plot <- data.frame(
crime = rep(row.names(crime_table), times = ncol(crime_table)),
month = rep(colnames(crime_table), each = nrow(crime_table)),
count = as.vector(crime_table)
)
custom_colors <- c("#E41A1C", "#377EB8", "#4DAF4A", "#984EA3", "#FF7F00", "#FFFF33", "#A65628", "#F781BF", "#999999", "#66C2A5", "#FC8D62", "#8DA0CB", "#E78AC3", "skyblue")
# Plot stacked bar chart
ggplot(crime_data_plot, aes(x = month, y = count, fill = crime)) +
geom_bar(stat = "identity") +
scale_fill_manual(values = custom_colors) +
labs(title = "Stacked Bar Plot of Crime Types Over Months",
x = "Month",
y = "Count") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
From the above plot we can observe that in the month of January there is
a peak in the total count of crimes committed and there is a minima in
the month of february. We can also observe that in all the months the
highest count of crimes committed are of the category “violent-crime”and
the lowest counts are that of “theft-from-the-person”, “robbery” and
“possession-of-weapons”.
We should look at the count of the crimes committed by category. We can use a pie chart to visualise this data.
crime_2 <- rowSums(crime_table)
crime_3 <- data.frame(crime_2)
colnames(crime_3) <- "Count"
kable(crime_3, caption="Crimes by Category")
| Count | |
|---|---|
| anti-social-behaviour | 677 |
| bicycle-theft | 235 |
| burglary | 225 |
| criminal-damage-arson | 581 |
| drugs | 208 |
| other-crime | 92 |
| other-theft | 491 |
| possession-of-weapons | 74 |
| public-order | 532 |
| robbery | 94 |
| shoplifting | 554 |
| theft-from-the-person | 76 |
| vehicle-crime | 406 |
| violent-crime | 2633 |
# Create a data frame from the provided table
crime_data <- data.frame(
crime = names(crime_2),
count = as.vector(crime_2)
)
total <- sum(crime_data$count)
crime_data$percentage <- crime_data$count/total *100
# Plot pie chart
ggplot(crime_data, aes(x = "", y = count, fill = crime)) +
geom_bar(stat = "identity", width = 2) +
coord_polar("y", start = 0) +
scale_fill_manual(values = custom_colors) +
labs(title = "Pie Chart of Crime Types",
fill = "Crime Type") +
theme_void() +
theme(legend.position = "right", legend.box.margin = margin(1, 1, 1, 1, "cm")) +
geom_text(aes(x = 2.25, label=paste0(round(percentage, 1), "%")),
position = position_stack(vjust = 0.5),
size=4)
From this plot we can observe that in the year 2023, the most occuring
crime is “violent-crime” (38.3%) and the least occurring types of crimes
are “theft-from-the-person” (1.1%), “robbery” (1.4%) and
“possession-of-weapons”(1.1%) and “other-crime” (1.3%).
Now that we have an understanding of the count of the crime categories, we can try to understand whether there is a ascertainable pattern in the outcome_status with the crime category.
Based on the outcome of the crime, we can visualise the most common type of outcome in the crimes committed in Colchester.
crime_data <- crime_df
library(plotly)
# Create a table of outcome_status
outcome_status_table <- table(crime_data$outcome_status)
# Convert the table to a data frame
outcome_status_data <- data.frame(outcome_status = names(outcome_status_table),
count = as.vector(outcome_status_table))
# Create the pie chart
pie_chart <- plot_ly(outcome_status_data, labels = ~outcome_status, values = ~count, type = "pie") %>%
layout(title = "Pie Chart of Crime Outcome Status")
# Print the pie chart
ggplotly(pie_chart)
From this plot, we can see that a majority of the crimes commited are of the following outcomes: - Investigation Complete, no suspect identified - Unable to prosecute suspect - No outcome
These constitute almost 77% of the reported crimes in Colchester and should be looked into by the law-enforcement officials.
In order to understand how crime outcome varies with the different crime categories we need to plot the count of each crime category for each outcome_status.
crime_data <- crime_df
outcome <- levels(crime_data$outcome_status)
outcome_table <- table(crime_data$category, crime_data$outcome_status)
#kable(outcome_table, caption = "Outcome by Crimes")
# Create a data frame from the provided table
outcome_data_plot <- data.frame(
crime = rep(row.names(outcome_table), times = ncol(outcome_table)),
outcome = rep(colnames(outcome_table), each = nrow(outcome_table)),
count = as.vector(outcome_table)
)
library(plotly)
# Convert crime column to factor
outcome_data_plot$crime <- as.factor(outcome_data_plot$crime)
# Plot the stacked bar chart
plot <- ggplot(outcome_data_plot, aes(x = outcome, y = count, fill = crime)) +
geom_bar(stat = "identity") +
labs(title = "Stacked Bar Plot of Crime Types Over Outcomes",
x = "Outcome",
y = "Count") +
theme(axis.text.x = element_text(angle = 25, hjust = 1))
# Convert ggplot object to plotly
plotly_plot <- ggplotly(plot)
# Print the interactive plot
ggplotly(plotly_plot)
From this plot we can see that all the NA values that we re-labelled as (No Outcome) are of the crime category “anti-social-behaviour”. We can also observe that the outcome of a majority of the crimes reported are “Investigation completed, no suspect identified” and “Unable to prosecute suspect”. Alse we can see that a majority of the crimes that are currently under observation are of the category of violent-crime.
We can visulise how different crime categories are happening geographically by plotting them on a map and colour coding the locations where the crime happened based on the category of the crime.
library(osmdata)
library(leaflet)
library(sf)
library(leaflet.extras)
# Convert category to factor
crime_df$category <- factor(crime_df$category)
# Define a color palette for the categories
category_palette <- colorFactor(palette = "Set1", domain = levels(crime_df$category))
# Create a leaflet map
map <- leaflet() %>%
addTiles() %>%
setView(lng = mean(crime_df$long), lat = mean(crime_df$lat), zoom = 14)
# Add clustered markers, color by category
map <- map %>%
addCircleMarkers(data = crime_df,
lng = ~long, lat = ~lat,
color = ~category_palette(category),
fillOpacity = 0.7, # Adjust transparency
stroke = FALSE, # Remove marker borders
radius = 5, # Adjust marker size
popup = ~paste("Latitude:", lat, "<br>Longitude:", long))
## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
# Get unique categories and their corresponding colors
unique_categories <- levels(crime_df$category)
category_colors <- category_palette(unique_categories)
## Warning in RColorBrewer::brewer.pal(max(3, n), palette): n too large, allowed maximum for palette Set1 is 9
## Returning the palette you asked for with that many colors
# Create legend
legend <- addLegend(map = map, position = "bottomright", colors = category_colors,
labels = unique_categories, title = "Crime Category")
# Display the map with legend
legend
We can also take a look at the crime distribution by street_id to understand the crime distribution in Colchester as per the dataset. We can use a density plot to visualise this:
crime_data <- data.frame(
street_id = levels(crime_df$street_id),
crime_count = as.numeric(table(crime_df$street_id))
)
# Plot density plot
ggplot(crime_data, aes(x = crime_count)) +
geom_density(fill = "skyblue", color = "blue") +
labs(title = "Density Plot of Crime Count by Street ID",
x = "Crime Count",
y = "Density") +
theme_minimal()
density_values <- density(crime_data$crime_count)$y
# Find the maximum density value (peak) and corresponding crime count
max_density <- max(density_values) # Replace 'density_values' with the density values from your plot
peak_crime_count <- crime_data$crime_count[which.max(density_values)]
peak_crime_count
## [1] 7
max_density
## [1] 0.04773014
max(crime_data$crime_count)
## [1] 241
median(crime_data$crime_count)
## [1] 9
From this plot we can make the following observations:
In order to further investigate the outliers with high amount of crime reported in the data, we can use a box-plot and filter out the street-ids that have exceptionally high crime count.
# Calculate quartiles and interquartile range
Q1 <- quantile(crime_data$crime_count, 0.25)
Q3 <- quantile(crime_data$crime_count, 0.75)
IQR <- Q3 - Q1
# Define upper and lower bounds for outliers
upper_bound <- Q3 + 1.5 * IQR
# Filter street IDs with exceptionally high crime counts
outliers <- crime_data$street_id[crime_data$crime_count > upper_bound]
# Filter crime_data to exclude outliers
non_outlier_counts <- crime_data[!crime_data$street_id %in% outliers, ]
par(mfrow = c(1, 2))
# Create box plot without ggplot
boxplot(non_outlier_counts$crime_count,
main = "Box Plot (without Outliers)",
ylab = "Crime Count",
col = "skyblue",
border = "blue",
boxwex = 0.5,
outline = TRUE)
boxplot(crime_data$crime_count,
main = "Box Plot (with Outliers)",
ylab = "Crime Count",
col = "red",
border = "blue",
boxwex = 0.5, # Adjust the width of the box
outline = TRUE)
Lets take a look at the street_ids with outliers:
outliers
## [1] "2152969" "2153000" "2153012" "2153014" "2153018" "2153025" "2153051"
## [8] "2153077" "2153092" "2153105" "2153107" "2153111" "2153123" "2153130"
## [15] "2153155" "2153158" "2153173" "2153180" "2153197" "2153213" "2153227"
## [22] "2153232" "2153238" "2153240" "2153318" "2153373" "2153436" "2153443"
## [29] "2153520" "2153541" "2153630"
len_out <- length(outliers)
outlier_counts <- crime_data[crime_data$street_id %in% outliers, ]
outlier_crimes <- sum(outlier_counts$crime_count)
totalcrimes <- nrow(crime_df)
crime_p <- outlier_crimes/totalcrimes * 100
crime_p
## [1] 40.14248
len_street <- length(levels(crime_df$street_id))
street_p <- len_out/len_street *100
street_p
## [1] 8.266667
From the above analysis we can clearly see that 40% of the reported crimes take place in these 31 street_ids (8.27% of the total street ids). This shows that the crime in colchester is highly concentrated on a few streets while majority of the streets have very low count of crime.
Now lets take a look at the category of crimes being committed in these 31 street_ids:
outlier_df <- crime_df[crime_df$street_id %in% outliers, ]
crime_count <- table(outlier_df$category)
tbl1 <- crime_count
tbl2 <- crime_2
merged_df <- cbind(tbl1, tbl2)
colnames(merged_df) <- c("outlier", "total")
#print(merged_df)
comparison_df <- data.frame(merged_df)
perc_df <- data.frame(comparison_df$outlier/comparison_df$total *100)
row.names(perc_df) <- row.names(comparison_df)
colnames(perc_df) <- c("Percentage of crimes in outlier streets")
kable(perc_df, caption = "Percentages of crime categories in the outlier streets")
| Percentage of crimes in outlier streets | |
|---|---|
| anti-social-behaviour | 41.50665 |
| bicycle-theft | 55.74468 |
| burglary | 22.22222 |
| criminal-damage-arson | 33.56282 |
| drugs | 45.19231 |
| other-crime | 23.91304 |
| other-theft | 38.28921 |
| possession-of-weapons | 39.18919 |
| public-order | 44.17293 |
| robbery | 36.17021 |
| shoplifting | 83.39350 |
| theft-from-the-person | 64.47368 |
| vehicle-crime | 19.21182 |
| violent-crime | 34.67528 |
From the above table we can observe that a significant percentage of certain crimes are committed in these 31 outlier street_ids in Colchester. Significant percentage of each crime category occuring in these street_ids:
Now that we have established that these certain street_ids have a significantly high amount of crime in almost all types of crime recorded, lets visualise the crime reported from these locations on a map with respect to crime reported from other locations and we can see how these locations are distributed geographically.
library(osmdata)
library(leaflet)
library(sf)
library(leaflet.extras)
non_outlier_df <- crime_df[!crime_df$street_id %in% outliers, ]
crime_sf <- st_as_sf(non_outlier_df, coords = c("long", "lat"), crs = 4326)
# Create a leaflet map
map <- leaflet() %>%
addTiles() %>%
setView(lng = mean(non_outlier_df$long), lat = mean(non_outlier_df$lat), zoom = 14)
# Add clustered markers
map <- map %>%
addCircleMarkers(data = outlier_df,
lng = ~long, lat = ~lat,
color = "red",
radius = 5,
popup = ~paste("Latitude:", lat, "<br>Longitude:", long))
# Add heatmap layer
map <- map %>%
addHeatmap(data = crime_sf, radius = 6)
# Display the map
map
40% of the Reported Crime in 2023 happened in the locations marked with the red dots and the rest 60% crimes happened in the blue patches marked on the map. This clearly shows that the crimes have occurred in clusters in the city, the biggest cluster being near the High Street in Colchester. The findings from this analysis could be interesting to law enforcement and could help them reduce overall crime significantly by focussing on these regions of interest.
In the Weather data set we have the following variables:
str(temp_data)
## 'data.frame': 365 obs. of 16 variables:
## $ station_ID : int 3590 3590 3590 3590 3590 3590 3590 3590 3590 3590 ...
## $ Date : Date, format: "2023-12-31" "2023-12-30" ...
## $ TemperatureCAvg: num 8.7 6.6 9.9 9.9 5.8 9.8 12.5 10 9.6 10 ...
## $ TemperatureCMax: num 10.6 9.7 11.4 11.5 10.6 12.7 14.3 12 10.8 12.6 ...
## $ TemperatureCMin: num 4.4 4.4 6.9 4 3.9 6.3 9.5 8.4 8.1 8.1 ...
## $ TdAvgC : num 7.2 4.2 6 7.5 3.7 7.6 10.1 7 6.5 6.2 ...
## $ HrAvg : num 89.6 85.5 77.2 84.6 86.4 86.9 85.3 81.5 81.2 78.2 ...
## $ WindkmhDir : Factor w/ 16 levels "E","ENE","ESE",..: 9 16 13 12 13 16 16 16 14 15 ...
## $ WindkmhInt : num 25 22.7 32.8 32.2 13.2 23.5 34.1 32.7 34.1 37.5 ...
## $ WindkmhGust : num 63 50 61.2 70.4 37.1 46.3 72.3 61.2 68.6 77.8 ...
## $ PresslevHp : num 999 1007 1004 1003 1016 ...
## $ Precmm : num 6.2 0.4 0.8 2.8 2 4.4 0.8 0.8 0 2 ...
## $ TotClOct : num 8 4.6 6.5 6.8 4 6.5 7.8 5 8 7.5 ...
## $ lowClOct : num 8 6.5 6.7 7.1 6.9 7.4 7.8 6.7 8 7.5 ...
## $ SunD1h : num 0 1.1 0.1 0 3.2 0 0 2.9 0 1.4 ...
## $ VisKm : num 26.3 48.3 26.7 25.1 30.1 45.8 61.8 72.9 69.4 34.3 ...
As a lot of the variables in the weather data is numeric data, we should start with a correlation matrix to understand how the data points influence each other and if there is any significant correlation in between them.
library(reshape2)
# Calculate the correlation matrix
correlation_matrix <- cor(temp_data[, c("TemperatureCAvg", "TemperatureCMax", "TemperatureCMin",
"TdAvgC", "HrAvg", "WindkmhInt", "WindkmhGust",
"PresslevHp", "Precmm", "TotClOct", "lowClOct",
"SunD1h", "VisKm")])
# Melt the correlation matrix for plotting
melted_correlation <- melt(correlation_matrix)
# Plot correlation heatmap
ggplot(melted_correlation, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "#1a9641", mid = "white", high = "#d7191c",
midpoint = 0, limits = c(-1, 1), name = "Correlation") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Correlation Heatmap of Numeric Variables",
x = "Variables",
y = "Variables")
From the above matrix we can see that there is strong correlation
between certain groups of variables, lets plot the TemperatureCAvg,
TemperatureCMax, TemperatureCMin, TdAvgC as a timeseries plot to
understand how the variables vary with time.
library(cowplot)
plot1 <- ggplot(temp_data, aes(x=Date, y = TemperatureCAvg, color = TemperatureCAvg))+
geom_line() +
geom_hline(yintercept = mean(temp_data$TemperatureCAvg, na.rm = TRUE),
color = "blue", linetype = "dashed", size = 1 ) + # Add mean line
geom_smooth(method = "loess", color = "red", size = 1) + # Add smooth line
labs(title = "Average Temperature Over Time",
x = "Date",
y = "Average Temperature",
color = "Temperature") +
theme_minimal()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
plot2 <- ggplot(temp_data, aes(x=Date, y = TemperatureCMax, color = TemperatureCMax))+
geom_line() +
geom_hline(yintercept = mean(temp_data$TemperatureCMax, na.rm = TRUE),
color = "blue", linetype = "dashed", size = 1 ) + # Add mean line
geom_smooth(method = "loess", color = "red", size = 1) + # Add smooth line
labs(title = "Max Temperature Over Time",
x = "Date",
y = "Max Temperature",
color = "Temperature") +
theme_minimal()
plot3 <- ggplot(temp_data, aes(x=Date, y = TemperatureCMin, color = TemperatureCMin))+
geom_line() +
geom_hline(yintercept = mean(temp_data$TemperatureCMin, na.rm = TRUE),
color = "blue", linetype = "dashed", size = 1 ) + # Add mean line
geom_smooth(method = "loess", color = "red", size = 1) + # Add smooth line
labs(title = "Min Temperature Over Time",
x = "Date",
y = "Min Temperature",
color = "Temperature") +
theme_minimal()
plot4 <- ggplot(temp_data, aes(x=Date, y = TdAvgC, color = TdAvgC))+
geom_line() +
geom_hline(yintercept = mean(temp_data$TdAvgC, na.rm = TRUE),
color = "blue", linetype = "dashed", size = 1 ) + # Add mean line
geom_smooth(method = "loess", color = "red", size = 1) + # Add smooth line
labs(title = "Average Dew Point Over Time",
x = "Date",
y = "Avg Dew Point",
color = "Temperature") +
theme_minimal()
combined_plot <- plot_grid(plot1, plot2, plot3, plot4, ncol=2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
combined_plot
# Plot time series for TemperatureCAvg, TemperatureCMax, TemperatureCMin, and TdAvgC
ggplot(temp_data, aes(x = Date)) +
geom_line(aes(y = TemperatureCAvg, color = "TemperatureCAvg"), size = 1) +
geom_line(aes(y = TemperatureCMax, color = "TemperatureCMax"), size = 1) +
geom_line(aes(y = TemperatureCMin, color = "TemperatureCMin"), size = 1) +
geom_line(aes(y = TdAvgC, color = "TdAvgC"), size = 1) +
scale_color_manual(values = c("TemperatureCAvg" = "blue",
"TemperatureCMax" = "red",
"TemperatureCMin" = "green",
"TdAvgC" = "orange")) +
labs(title = "Time Series Plot of Temperature Variables",
x = "Date",
y = "Temperature (°C)",
color = "Variable") +
theme_minimal()
From the above plot we can clearly see that all of the above variables
follow a similar pattern and their peaks and troughs are largely
overlapping. This conclusion can also be arrived at intuitively. But
something notable is the relationship between the Average Dew Point
(TdAvgC) and the Average Temperature.
We can also look at the time series variation of the precipitation variable over time:
ggplot(temp_data, aes(x=Date, y = Precmm, color = Precmm))+
geom_line() +
geom_hline(yintercept = mean(temp_data$Precmm, na.rm = TRUE),
color = "seagreen", linetype = "dashed", size = 1 ) + # Add mean line
geom_smooth(method = "loess", color = "red", size = 1) + # Add smooth line
labs(title = "Precipitation Over Time",
x = "Date",
y = "Precipitation",
color = "Temperature") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Lets visualise scatter plots between all the numeric variables in the weather dataset:
# Load required library
library(ggplot2)
temp_x_data <- temp_data[, !(names(temp_data) %in% c("station_ID", "TotClOct", "TemperatureCMax", "TemperatureCMin", "WindkmhDir", "WindkmhInt", "WindkmhGust", "PreselevHp"))]
pairs(temp_x_data, pch = ".")
Analysing the relationship between Average Temperature and Average Dew Point using a scatter plot and then fitting a linear regression model between them using the smoothing function:
library(ggplot2)
library(plotly)
# Create the ggplot scatter plot with linear regression line
scatter_plot <- ggplot(temp_data, aes(x = TemperatureCAvg, y = TdAvgC)) +
geom_point(color = "red") +
geom_smooth(method = "lm", se = TRUE) + # Add linear regression line without confidence interval
labs(title = "Scatter Plot: Average Temperature vs. Average Dew Point with Linear Regression",
x = "Average Temperature (°C)",
y = "Average Dew Point (°C)") +
theme_minimal()
# Convert ggplot to interactive plotly plot
interactive_plot <- ggplotly(scatter_plot)
## `geom_smooth()` using formula = 'y ~ x'
# Print the interactive plot
ggplotly(interactive_plot)
The scatter plot and linear regression line demonstrate a positive correlation between average temperature and average dew point, indicating that as temperature increases, so does dew point, reflecting higher moisture levels in warmer conditions. The close clustering of data points around the regression line suggests a strong correlation between temperature and dew point changes, supporting the suitability of the linear model within the observed data range.
Lets also plot HrAvg with the TemperatureCMax which has a strong negative correlation between each other. HrAvg is the average relative humidity in %.
library(ggplot2)
library(plotly)
# Convert Date column to Date type
temp_data$Date <- as.Date(temp_data$Date)
# Create the plot
plot <- ggplot(temp_data, aes(x = Date)) +
geom_line(aes(y = TemperatureCMax, color = "TemperatureCMax"), size = 1) +
geom_line(aes(y = HrAvg, color = "HrAvg"), size = 1) +
scale_color_manual(values = c("TemperatureCMax" = "red",
"HrAvg" = "green")) +
labs(title = "Time Series Plot of Temperature and Humidity",
x = "Date",
y = "Temperature (°C)",
color = "Variable") +
theme_minimal()
# Convert ggplot object to plotly
plotly_plot <- ggplotly(plot)
# Print the interactive plot
ggplotly(plotly_plot)
The time series plot illustrates a strong negative correlation between average relative humidity (HrAvg) and maximum temperature (TemperatureCMax), indicating that warmer temperatures coincide with lower humidity levels and vice versa, consistent with meteorological principles. Peaks in temperature align with troughs in humidity, reflecting the inverse relationship between the two variables, where warmer air can hold more moisture. Synchronized fluctuations suggest a dynamic interplay, with higher temperatures leading to greater humidity reductions and vice versa. Temporal patterns reveal trends and seasonal cycles, enhancing understanding of the relationship between temperature and humidity over time.
To understand the relationship between the various variables in the correlation matrix, we can do a few scatter plots:
Scatter Plot between Average Temperature and Average Relative Humidity
library(plotly)
# Create the scatter plot
scatter_plot <- plot_ly(temp_data, x = ~TemperatureCAvg, y = ~HrAvg, color = I("red")) %>%
add_markers() %>%
layout(title = "Scatter Plot: Average Temperature vs. Average Relative Humidity",
xaxis = list(title = "Average Temperature (°C)"),
yaxis = list(title = "Average Relative Humidity (%)"),
showlegend = FALSE)
# Print the interactive scatter plot
ggplotly(scatter_plot)
We can use the smoothing function to to add a linear regression line to the scatter plot without the confidence intervals. The line will be the best-fitting linear relationship between Average Temperature and Average Humidity:
library(ggplot2)
library(plotly)
# Create the ggplot scatter plot with linear regression line
scatter_plot <- ggplot(temp_data, aes(x = TemperatureCAvg, y = HrAvg)) +
geom_point(color = "red") +
geom_smooth(method = "lm", se = TRUE) + # Add linear regression line without confidence interval
labs(title = "Scatter Plot: Average Temperature vs. Average Humidity with Linear Regression",
x = "Average Temperature (°C)",
y = "Average Humidity (%)") +
theme_minimal()
# Convert ggplot to interactive plotly plot
interactive_plot <- ggplotly(scatter_plot)
## `geom_smooth()` using formula = 'y ~ x'
# Print the interactive plot
ggplotly(interactive_plot)
The scatter plot and fitted linear regression line suggest a negative correlation between average temperature and average relative humidity, with warmer temperatures associated with lower humidity levels and vice versa. Despite some outliers, the regression line generally fits the data well, indicating a reasonably strong correlation. However, factors like seasonal variations and local weather phenomena may influence the observed pattern, warranting further analysis.
Analyzing the relationship between Average Humidity and Sun Duration in Hours using a scatter plot and then fitting a linear regression model between them:
# Create scatter plot between HrAvg and SunD1h with linear regression line
ggplot(temp_data, aes(x = HrAvg, y = SunD1h)) +
geom_point(color="red") +
geom_smooth(method = "lm", se = TRUE) + # Add linear regression line without confidence interval
labs(title = "Scatter Plot: Average Humidity vs. Sun Duration with Linear Regression",
x = "Average Humidity",
y = "Sun Duration") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The scatter plot and linear regression line between average humidity (HrAvg) and sun duration in hours (SunD1h) reveal a negative correlation, indicating that as humidity increases, sun duration tends to decrease. The regression line fits the data reasonably well, suggesting a strong correlation, consistent with meteorological principles where higher humidity levels often coincide with increased cloud cover, obstructing sunlight. However, factors like seasonal variations and weather patterns may influence this relationship. This analysis underscores the significance of certain variables, like average temperature and sun duration, on each other within the weather dataset, suggesting that investigations into their relationships with other variables in a crime dataset need not consider variables from the weather dataset.
Since the crime data is reported on a monthly time quantum, it is essential for us to reduce the temporal scope of the weather data into a monthly time frame so that we can merge the two datasets and combine them into a single data frame and analyse the crime data in light of the weather data.
library(dplyr)
library(lubridate)
# Convert Date column to month-year format
temp_data$MonthYear <- month(temp_data$Date)
# Group by MonthYear and calculate monthly medians
monthly_data <- temp_data %>%
group_by(MonthYear) %>%
summarize(
TemperatureCAvg = median(TemperatureCAvg),
Precmm = median(Precmm),
TotClOct = median(TotClOct),
SunD1h = median(SunD1h))
crime_df$MonthYear <- as.integer(substr(crime_df$date, 6, 7))
merged_data <- merge(crime_df, monthly_data, by = "MonthYear")
#merged_data <- merged_data[, -which(names(merged_data) == "month")]
# View the merged data
head(merged_data)
## MonthYear category date lat long street_id
## 1 1 anti-social-behaviour 2023-01 51.88306 0.909136 2153366
## 2 1 anti-social-behaviour 2023-01 51.90124 0.901681 2153173
## 3 1 anti-social-behaviour 2023-01 51.88907 0.897722 2153077
## 4 1 anti-social-behaviour 2023-01 51.89122 0.901988 2153186
## 5 1 anti-social-behaviour 2023-01 51.89416 0.895433 2153012
## 6 1 anti-social-behaviour 2023-01 51.88050 0.909014 2153379
## street_name location_type outcome_status TemperatureCAvg
## 1 On or near Military Road Force No Outcome 4.6
## 2 On or near Force No Outcome 4.6
## 3 On or near Culver Street West Force No Outcome 4.6
## 4 On or near Ryegate Road Force No Outcome 4.6
## 5 On or near Market Close Force No Outcome 4.6
## 6 On or near Lisle Road Force No Outcome 4.6
## Precmm TotClOct SunD1h
## 1 0.2 5.1 0
## 2 0.2 5.1 0
## 3 0.2 5.1 0
## 4 0.2 5.1 0
## 5 0.2 5.1 0
## 6 0.2 5.1 0
In the next step of our analysis, we will investigate the impact of median Monthly Temperature on the Crime Count:
# Aggregate the crime data by month
crime_count <- merged_data %>%
group_by(TemperatureCAvg) %>%
summarise(crime_count = n())
str(crime_count)
## tibble [12 × 2] (S3: tbl_df/tbl/data.frame)
## $ TemperatureCAvg: num [1:12] 4.6 4.9 5.85 7.5 7.9 8.15 11.7 12.7 16.7 16.8 ...
## $ crime_count : int [1:12] 651 555 467 563 551 574 586 592 584 642 ...
# Create scatter plot between HrAvg and SunD1h with linear regression line
ggplot(crime_count, aes(x = TemperatureCAvg, y = crime_count)) +
geom_point(color="red") +
geom_smooth(method = "lm", se = TRUE) + # Add linear regression line without confidence interval
labs(title = "Scatter Plot: Temperature vs. Crime Count with Linear Regression",
x = "Monthly Temperature",
y = "Crime Count") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The scatter plot and fitted linear regression line between median monthly temperature (TemperatureCAvg) and crime count reveal a positive correlation, indicating that as temperature increases, so does the crime count, and vice versa. Despite some outliers, the regression line generally fits the data well, suggesting a reasonably strong correlation, although the smaller dataset size may impact robustness. Factors like socioeconomic conditions and law enforcement policies may influence this relationship, warranting further analysis to enhance understanding.
In the next step of our analysis, we will investigate the impact of median Monthly Precipitation on the Crime Count:
# Aggregate the crime data by month
crime_count <- merged_data %>%
group_by(Precmm) %>%
summarise(crime_count = n())
str(crime_count)
## tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
## $ Precmm : num [1:3] 0 0.2 0.4
## $ crime_count: int [1:3] 5113 1214 551
# Create scatter plot between HrAvg and SunD1h with linear regression line
ggplot(crime_count, aes(x = Precmm, y = crime_count)) +
geom_point(color="red") +
geom_smooth(method = "lm", se = TRUE) + # Add linear regression line without confidence interval
labs(title = "Scatter Plot: Precmm vs. Crime Count with Linear Regression",
x = "Precmm",
y = "Crime Count") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The scatter plot and fitted linear regression line between median monthly precipitation (Precmm) and crime count suggest a negative correlation, indicating that as precipitation increases, crime count tends to decrease, and vice versa. Despite having only three data points plotted, the regression line fits the data reasonably well, suggesting a strong correlation based on the limited data available. However, caution is warranted due to the small sample size, which may impact the reliability of conclusions drawn. Factors like seasonal variations and socioeconomic conditions may influence this relationship, necessitating further analysis with additional data and variables to better understand the dynamics between precipitation and crime count.
In the next step of our analysis, we will investigate the impact of median Monthly Sun Duration on the Crime Count:
# Aggregate the crime data by month
crime_count <- merged_data %>%
group_by(SunD1h) %>%
summarise(crime_count = n())
str(crime_count)
## tibble [9 × 2] (S3: tbl_df/tbl/data.frame)
## $ SunD1h : num [1:9] 0 2.6 3.3 5.2 6.15 6.2 6.5 7 9.9
## $ crime_count: int [1:9] 2224 563 592 584 642 574 550 586 563
# Create scatter plot between HrAvg and SunD1h with linear regression line
ggplot(crime_count, aes(x = SunD1h, y = crime_count)) +
geom_point(color="red") +
geom_smooth(method = "lm", se = TRUE) + # Add linear regression line without confidence interval
labs(title = "Scatter Plot: SunD1h vs. Crime Count with Linear Regression",
x = "SunD1h",
y = "Crime Count") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
The analysis indicates a negative correlation between median monthly sun duration (SunD1h) and crime count, with longer sun durations associated with lower crime counts. However, the moderate fit of the regression line and the presence of outliers suggest potential variability and other influencing factors. With limited data available, caution is needed in generalizing these findings. Further research incorporating additional data and considering other factors is necessary to better understand the relationship between sun duration and crime count.
In our thorough examination of crime statistics in Colchester, we set out to reveal the complex factors that influence the city’s crime trends. By engaging in thorough data preparation, cleaning, and exploration, we thoroughly examined the temporal, spatial, and environmental aspects of crime incidents. This trip gave important understanding of the patterns and trends in criminal activities, as well as the various factors that influence them.
One of the main discoveries from our study was the intricate relationship among different factors impacting the frequency of crime. Moreover, our investigation revealed fluctuations in crime patterns throughout the year, emphasizing the influence of weather on criminal activities. These observations highlight the intricate aspects of crime patterns and stress the necessity of taking various factors into account to comprehend and combat criminal behaviors.
Geospatial analysis was essential in offering a spatial view of crime occurrences in Colchester. By pinpointing clusters of crime incidents in various locations, we were able to identify areas with high concentrations of crime. This knowledge of space provided law enforcement with useful information to improve focused policing and create better crime prevention methods. Furthermore, our examination of outlier street IDs offered important perspectives on how crime is spread across the city, highlighting the importance of targeted interventions and involving the community.
Furthermore, significant relationships between crime counts and weather variables were revealed through correlation and regression analyses. We showed how temperature, rainfall, and other environmental factors can be used to predict criminal behavior using statistical modeling. This ability to predict not only improves our grasp of crime trends but also provides useful information for law enforcement and policymakers to develop proactive strategies against crime.
Looking ahead, it will be essential to maintain cooperation among data scientists, law enforcement agencies, and community stakeholders to create safer and more secure communities. Through utilizing the combined knowledge and assets, we can work towards a future in which insights based on data lead to a society that is more resilient and fair. This report demonstrates how data analytics has the power to address difficult social issues and create positive change. By conducting thorough analysis and making informed decisions, we can strive to create safer and more vibrant communities for future generations.